Nick Schaap
CMSC320 | Spring 2022
Growing up in Montgomery County, MD, I have first-hand knowledge of some of the biggest issues faced by my county's school district. MCPS is the 14th largest public school district in the US by enrollment. It is also extremely diverse. Its 165,000+ students comprise more than 150+ countries and speak more than 150 languages (https://www.montgomeryschoolsmd.org/about/). One of the biggest issues faced by MCPS is the achievement gap. Currently, there exist large discrepancies in educational outcomes across MCPS's 26 high school clusters. This can largely be attributed to various external factors such as income level and other environmental factors. I wanted to examine specifically the correlation between crime and key school profile information such as graduation rate, attendance rate, dropout rates, free and reduced lunch program enrollment (FARMS), and student to staff ratios to see if MCPS is providing more staff support in underprivileged areas.
import folium
from folium.plugins import HeatMap
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from fuzzywuzzy import fuzz
import geopy.distance
I wanted to collect both graduation rate and other profile information for each high school as well as crime data. The crime data dated as far back as mid 2016. I used the MCPS website to collect school at a glance information which was available up until 2021. Thus, I retrieved data from both sources from 2016-2021. I will be focusing my analysis during this time frame as this is when I was able to find data from both sources. The data is tabulated in pdf form so I quickly copied and pasted the data into an excel sheet and exported it as a CSV. Then I used pandas to do some data wrangling and parse the values for each school and year. I also used an open data API to retrieve school lattitude and longitude data so that I would be able to to filter crime data that is nearby certain schools. I matched the API data to the data retreived from the MCPS website using a library called fuzzywuzzy. I used fuzzywuzzy to sort the school info rows based off how closely the school name matched a school name value from the MCPS data. Then I took the closest match. This acccounted for any small discrepancies in how school names were formatted.
# School profile information
# Read in the csv file
school_data = pd.read_csv('school_data.csv')
# Melt column data into individual row entries
school_data = school_data.melt(var_name="Year", value_name="Value")
for (i, row) in school_data.iterrows():
values = row['Value'].split()
columns = ['FARMS', 'ATTENDANCE', 'GRADUATION', 'DROPOUT', 'MOBILITY', 'STUD/STAFF RATIO', 'ENG CLASS SIZE', 'OTHER CLASS SIZE']
# Parsing individual metric values
school_data.at[i, 'High School'] = " ".join(values[0:(len(values) - len(columns))])
for j in range(0, len(columns)):
school_data.at[i, columns[-(j+1)]] = values[-(j+1)]
school_data = school_data.drop(columns=['Value'])
# Removing <,> quantifiers. Some of the profile information is intentionally masked at upper
# boundaries for student privacy reasons.
def containsNumber(value):
for character in value:
if character.isdigit():
return True
return False
school_data['GRADUATION'] = school_data['GRADUATION'].apply(lambda x: float(x) if x[0].isdigit() else float(x[1:]) if containsNumber(x) else -1)
school_data = school_data[school_data['GRADUATION'] > 0]
school_data['FARMS'] = school_data['FARMS'].apply(lambda x: float(x) if x[0].isdigit() else float(x[1:]) if containsNumber(x) else -1)
school_data = school_data[school_data['FARMS'] > 0]
school_data['DROPOUT'] = school_data['DROPOUT'].apply(lambda x: float(x) if x[0].isdigit() else float(x[1:]) if containsNumber(x) else -1)
school_data = school_data[school_data['DROPOUT'] > 0]
school_data['ATTENDANCE'] = school_data['ATTENDANCE'].apply(lambda x: float(x) if x[0].isdigit() else float(x[1:]) if containsNumber(x) else -1)
school_data = school_data[school_data['ATTENDANCE'] > 0]
school_data['STUD/STAFF RATIO'] = school_data['STUD/STAFF RATIO'].apply(lambda x: float(x) if x[0].isdigit() else float(x[1:]) if containsNumber(x) else -1)
school_data = school_data[school_data['STUD/STAFF RATIO'] > 0]
school_data = school_data.astype({'Year': 'int'})
school_data = school_data[school_data['Year'] >= 2016]
# School geographical information
schools_info = pd.read_json('https://data.montgomerycountymd.gov/resource/7ycz-azby.json')
for (i, row) in school_data.iterrows():
school_name = row["High School"]
find_school = schools_info.copy(deep=True)
# Finding the closest match to the API data using fuzzywuzzy partial_ratio on the school name
find_school['match_score'] = find_school.apply(lambda x: fuzz.partial_ratio(school_name, x["school_name"]), axis=1)
# Sorting based on similarity scores
find_school = find_school.sort_values(by='match_score', ascending=False).head(1).index.values[0]
# Choosing the row index with the highest similarity as the matching index
school_data.at[i, "school_info_index"] = find_school
school_data = school_data.astype({'school_info_index':'int'})
school_data
# Filtering the API columns to those of interest/ value
schools_info = schools_info[['school_name', 'zip_code', 'city', 'address', 'latitude', 'longitude']]
schools_info
# Lookup function that returns the API data row corresponding to a particular school
def get_school_info(school_name):
return schools_info[schools_info["school_name"] == school_name].head(1).reset_index().iloc[0,:]
I obtained crime data from the Montgomery County, Maryland open data archive. This data dates back to 2013 and contains over 294,000 crime records. This API was initally set to limit returned records to 1000 records. I passed in a query parameter to override this default as well as order the records according to the start_date of the logged crime event. This allowed me to avoid writing my own pagination logic to retreive all records of the dataset but also results in large response times (~30 seconds).
crime_data = pd.read_json("https://data.montgomerycountymd.gov/resource/icn6-v9z3.json?$limit=300000&$order=start_date")
crime_data
The Montgomery County Open Data Crime API returns various interesting information about each recorded crime including: nibrs_code, crimename1, crimename2, crimename3, start_date, latitude, and longitude. I plan to use the latitude and longitude information to filter crime events near various schools. The nibrs_code is another interesting feature that allows me to filter the crime data for specific categories of offenses as defined by the National Incident-Based Reporting System standards.
# Function that uses the folium library to generate a heat map of the various crimes passed
# as well as labeling the location of each of MCPS high school
def generate_heat_map(crimes, zoom=11, center=[39.1547, -77.2405], include=None):
map_osm = folium.Map(location=center, zoom_start=zoom, tiles = "Stamen Toner")
for _, school in schools_info.iterrows():
if include is not None and school["school_name"] not in include:
continue
folium.Marker(location=[school["latitude"], school["longitude"]], tooltip=school["school_name"],
icon=folium.Icon(color='red')).add_to(map_osm)
heat_data = [[row['latitude'],row['longitude']] for _, row in crimes.iterrows()]
HeatMap(heat_data, radius=15).add_to(map_osm)
return map_osm
All Crimes
generate_heat_map(crime_data)
The above heat map shows crime data as well as the locations of MCPS's high schools. Just from looking at this map, no interesting patterns jump out immediately. It seems like crime levels are generally pretty uniformly distributed around each high school. I wanted to take a closer look and specifically look at crimes against people. I also want to zoom in a bit more to inspect a more detailed heat map.
Mapping Crimes Against People
crimes_against_people = crime_data[crime_data['crimename1'].str.contains("Person", na=False)]
generate_heat_map(crimes_against_people, zoom=13)
This map returned more meaningful results. Now I could clearly see how some high schools such as Wheaton High School had relatively high levels of crimes against people while other high schools such as Walt Whitman High School saw relatively low levels of crime against people.
Crimes against People in the Area Surrounding Wheaton High School
center = list(get_school_info("Wheaton HS")[["latitude", "longitude"]])
generate_heat_map(crimes_against_people, zoom=13, center=center, include=["Wheaton HS"])
Crimes against People in the Area Surrounding Walt Whitman High School
center = list(get_school_info("Walt Whitman HS")[["latitude", "longitude"]])
generate_heat_map(crimes_against_people, zoom=13, center=center, include=["Walt Whitman HS"])
Now, that I have visualized a discrepancy in crime surrounding different high schools around the county I want to begin connecting the dots and see how crime may be affecting various key educational outcomes and school profile details.
metrics = ['FARMS', 'ATTENDANCE', 'GRADUATION', 'DROPOUT', 'STUD/STAFF RATIO']
for metric in metrics:
for school in school_data['High School'].unique():
school_info = school_data[school_data['High School'] == school]
plt.plot(school_info['Year'].sort_values(), school_info[metric], label=school)
plt.title(f'{metric.title()} rate vs. time')
plt.xlabel('Year')
plt.ylabel(f'{metric.title()} rate')
plt.show()
I hypothesize that an increase in crime in the area surrounding (within 5 mile radius) a high school will have a negative impact on all of the school profile metrics under study (graduation rate, attendance rate, dropout rates, free and reduced lunch program enrollment (FARMS)). In order to test my hypothesize I needed to find a way to get all of the crimes in the surrounding area of each school. To accomplish this, I created a function that will filter the crime data retrieved from the Montgomery County Open Data API to select only crimes within a 5 mile radius of the passed in school.
I used a library called geopy in order to calculate the distance between two points defined by their latitude and longitude. Geopy does this using the geodesic distance between the two points. This accounts for the rounding of the earth when calculating distances. This made finding the distance between schools and crime incidents rather trivial since I was given this information in both of my datasets.
def get_crimes_near_school(school_name, nibrs_codes = None, mile_radius = 5, crime_list = None):
# Allow filtering to select only crimes conforming to certain nibrs codes
if crime_list is None:
crime_list = crime_data
if nibrs_codes is not None:
crimes = crime_list[crime_list['nibrs_code'].str.contains("|".join(nibrs_codes), na=False)]
else:
crimes = crime_list
school = get_school_info(school_name)
# Extract school latitude and longitude information
location = tuple(school[["latitude", "longitude"]])
crimes = crimes.copy(deep=True)
for(i, row) in crimes.iterrows():
crime_location = tuple(row[["latitude", "longitude"]])
# Define a crime as being near a school if it occurred within 5 miles of the school
crimes.at[i, "near-school"] = geopy.distance.distance(location, crime_location).miles <= mile_radius
# Return all crimes marked as near the passed in school
return crimes[crimes["near-school"] == True]
In order to do some initial plotting I took the raw timebased school metric data and averaged over the entire period for which data was collected. This gave me a characteristic value for each of the schools. I also added a column showing the number of crimes committed within a 5 mile radius of the school. In order to filter down the number of crimes I specifically focused on crimes marked as family offenses (NIBRS CODE = 90F). These crimes are likely to have the most profound impact on educational outcomes as they are ones that involve minors. Crimes may include neglect and domestic violence.
# Collecting a description of all crimes listed as nibrs code 90F (family offenses) in the dataset
crime_data[crime_data['nibrs_code'] == "90F"]["crimename3"].unique()
# Grouping the school data by High school and the school info index and taking the mean of any columns containing floating point data
average_school_rates = school_data.groupby(by=["High School", "school_info_index"]).mean().reset_index()
for (i, row) in average_school_rates.iterrows():
# Get the appropriate school name used by the school info dataframe
school_name = schools_info.iloc[row["school_info_index"]]["school_name"]
# Get crimes near the school
crimes = get_crimes_near_school(school_name, nibrs_codes=['90F'])
# Count the numebr of crimes
numCrimes = crimes["incident_id"].count()
average_school_rates.at[i, "numCrimes"] = numCrimes
average_school_rates = average_school_rates.sort_values(by="numCrimes")
average_school_rates
I created a helper function that allows me to plot a passed in school profile metric vs number of crimes. This allowed me to see the relationship between the two if one existed. I also used numpy to add a linear regression line to the resulting plots.
def plot_school_rates_vs_crime(rate, plt, label_min=None, deg = 1, label_max=None, school_labels=None, crime_type="family crimes"):
x = average_school_rates["numCrimes"]
y = average_school_rates[rate]
schools = average_school_rates["High School"]
coef = np.polynomial.polynomial.polyfit(x, y, deg)
x_fit = range(int(np.max(x)) + 5)
y_fit = np.polyval(coef[::-1], x_fit)
plt.plot(x,y, 'yo', x_fit, y_fit, '--k')
for i, label in enumerate(schools):
if (label_min is not None and y[i] <= label_min) or (label_max is not None and y[i] >= label_max) or (school_labels is not None and label in school_labels):
plt.annotate(label, (x[i], y[i]))
plt.set_title(f'Average {rate.lower()} rate vs. number of {crime_type} (2016-present)')
plt.set_xlabel(f'Incidence of {crime_type} between (2016-present)')
plt.set_ylabel(f'Average {rate.lower()} rate')
# Allow all plots to share the same x-axis
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, sharex=True)
axs = (ax1, ax2, ax3, ax4)
fig.set_size_inches(10, 20)
plot_school_rates_vs_crime("GRADUATION", ax1)
plot_school_rates_vs_crime("FARMS", ax2)
plot_school_rates_vs_crime("ATTENDANCE", ax3)
plot_school_rates_vs_crime("DROPOUT", ax4)
for ax in axs:
ax.label_outer()
By looking at the above plots it's clear that some kind of correlation exists for each of the observed school metrics and family-related crime incidence rates. Some of these appear to be linear while others seem to be higher order. For example, the average FARMS enrollment rate seems to increase linearly with increasing crime incidence while other metrics such as dropout rate seem to grow slowly until around 200 crimes when the dropout rate begins to increase dramatically. In order to confirm the relationship between the variables I want to run a linear regression analysis on my data and observe the various p-values for the coefficients of the linear model.
# Importing regression model packages
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
metrics = ["GRADUATION", "DROPOUT", "ATTENDANCE", "FARMS"]
for metric in metrics:
X = average_school_rates['numCrimes']
X = np.array(X).reshape(-1, 1)
y = np.array(average_school_rates[metric])
# Creating a linear regression model
reg = LinearRegression().fit(X, y)
X = sm.add_constant(X)
est = sm.OLS(y, X).fit()
[const_pvalue, x1_pvalue] = est.pvalues
[const, x1] = est.params
print(metric)
print(f"const: {const} (p-value: {const_pvalue})")
print(f"x1: {x1} (p-value: {x1_pvalue}")
print(f"R^2 {est.rsquared}")
print("-----------------------------------------")
The p-values for the Linear Regression fit seem to by less than the standard significance level $\alpha < 0.05$ for all coefficients except the constant on the FARMS regression. This makes a good case for the linear regression model to be used as a predictor between each school metric and the crime incidence in the surrounding area, however I want to try fitting polynomials of higher degree to see if we can obtain a better fit. However, the $R^2$ values lie around 0.5. This isn't a particularly correlation coefficient and this could mean that the true correlation is not completely explained by this single variable.
# Plotting fits with degree 2
# Allow all plots to share the same x-axis
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, sharex=True)
axs = (ax1, ax2, ax3, ax4)
fig.set_size_inches(10, 20)
plot_school_rates_vs_crime("GRADUATION", ax1, deg = 2)
plot_school_rates_vs_crime("FARMS", ax2, deg = 2)
plot_school_rates_vs_crime("ATTENDANCE", ax3, deg = 2)
plot_school_rates_vs_crime("DROPOUT", ax4, deg = 2)
for ax in axs:
ax.label_outer()
The 2nd degree polynomial fits seem to do a better job at fitting the data especially at higher crime incidence rates. It seems that many of these metrics seem to grow / diminish at higher rates as the crime incidence increases. One drawback from a quadratic fit is the concavity of fit in lower crime incidence rates. For example, it doesn't make much sense that attendance rate would increase at first.
I wanted to see how MCPS may be supporting under-privileged schools by increasing staff support at these schools. In order to get a sense for this I will plot student to staff ratios vs crime incidence.
fig, (ax1) = plt.subplots(1, 1, sharex=True)
fig.set_size_inches(10, 5)
plot_school_rates_vs_crime("STUD/STAFF RATIO", ax1, deg = 1, label_max=14)
metric = "STUD/STAFF RATIO"
X = average_school_rates['numCrimes']
X = np.array(X).reshape(-1, 1)
y = np.array(average_school_rates[metric])
X = sm.add_constant(X)
est = sm.OLS(y, X).fit()
[const_pvalue, x1_pvalue] = est.pvalues
[const, x1] = est.params
print(metric)
print(f"const: {const} (p-value: {const_pvalue})")
print(f"x1: {x1} (p-value: {x1_pvalue})")
print(f"R^2 {est.rsquared}")
print("-----------------------------------------")
There seems to be a negative correlation between average student/staff ratios and crime incidence rates. This means that MCPS does seem to be supplying more staff support to schools in under-privileged areas. However, there are still some anomalies. For example, Damascus HS has relatively high student/staff ratio in comparison to other MCPS high schools with the same crime incidence rates. This could either indicate Damascus HS is being under-served or Damascus HS is being served in other ways that allow it to still perform on par with other MCPS high schools. Let's find out...
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, sharex=True)
axs = (ax1, ax2, ax3, ax4)
fig.set_size_inches(10, 20)
plot_school_rates_vs_crime("GRADUATION", ax1, deg = 1, school_labels=["Damascus HS"])
plot_school_rates_vs_crime("FARMS", ax2, deg = 1, school_labels=["Damascus HS"])
plot_school_rates_vs_crime("ATTENDANCE", ax3, deg = 1, school_labels=["Damascus HS"])
plot_school_rates_vs_crime("DROPOUT", ax4, deg = 1, school_labels=["Damascus HS"])
By looking at how Damascus HS performs w.r.t. the various key school metrics we chose to explore it is clear that Damascus HS is still performing on par with the rest of MCPS high schools. It has both positive residuals for graduation and attendance rate as well as negative residuals for FARMS and dropout rates. From this we can likely conclude that Damascus HS has found other ways to get extra support and improve key school metrics and educational outcomes other than increased staff support.
drug_crimes = crime_data[crime_data['nibrs_code'].str.contains("|".join(['35A', '35B']), na=False)]
# Take a random sample of 500 crime records from the total time frame to limit computer processing
drug_crimes = drug_crimes.sample(500)
for (i, row) in average_school_rates.iterrows():
# Get the appropriate school name used by the school info dataframe
school_name = schools_info.iloc[row["school_info_index"]]["school_name"]
# Get crimes near the school
crimes = get_crimes_near_school(school_name, nibrs_codes=['35A', '35B'], crime_list=drug_crimes)
# Count the numebr of crimes
numCrimes = crimes["incident_id"].count()
average_school_rates.at[i, "numCrimes"] = numCrimes
# Allow all plots to share the same x-axis
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, sharex=True)
axs = (ax1, ax2, ax3, ax4)
fig.set_size_inches(10, 20)
plot_school_rates_vs_crime("GRADUATION", ax1, crime_type="drug-related crimes")
plot_school_rates_vs_crime("FARMS", ax2, crime_type="drug-related crimes")
plot_school_rates_vs_crime("ATTENDANCE", ax3, crime_type="drug-related crimes")
plot_school_rates_vs_crime("DROPOUT", ax4, crime_type="drug-related crimes")
for ax in axs:
ax.label_outer()
metrics = ["GRADUATION", "DROPOUT", "ATTENDANCE", "FARMS"]
for metric in metrics:
X = average_school_rates['numCrimes']
X = np.array(X).reshape(-1, 1)
y = np.array(average_school_rates[metric])
# Creating a linear regression model
reg = LinearRegression().fit(X, y)
X = sm.add_constant(X)
est = sm.OLS(y, X).fit()
[const_pvalue, x1_pvalue] = est.pvalues
[const, x1] = est.params
print(metric)
print(f"const: {const} (p-value: {const_pvalue})")
print(f"x1: {x1} (p-value: {x1_pvalue}")
print(f"R^2 {est.rsquared}")
print("-----------------------------------------")
fraud_crimes = crime_data[crime_data['nibrs_code'].str.contains("|".join(['26']), na=False)]
fraud_crimes = fraud_crimes.sample(500)
for (i, row) in average_school_rates.iterrows():
# Get the appropriate school name used by the school info dataframe
school_name = schools_info.iloc[row["school_info_index"]]["school_name"]
# Get crimes near the school
crimes = get_crimes_near_school(school_name, nibrs_codes=['26'], crime_list=fraud_crimes)
# Count the numebr of crimes
numCrimes = crimes["incident_id"].count()
average_school_rates.at[i, "numCrimes"] = numCrimes
# Allow all plots to share the same x-axis
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, sharex=True)
axs = (ax1, ax2, ax3, ax4)
fig.set_size_inches(10, 20)
plot_school_rates_vs_crime("GRADUATION", ax1, crime_type="fraud-related crimes")
plot_school_rates_vs_crime("FARMS", ax2, crime_type="fraud-related crimes")
plot_school_rates_vs_crime("ATTENDANCE", ax3, crime_type="fraud-related crimes")
plot_school_rates_vs_crime("DROPOUT", ax4, crime_type="fraud-related crimes")
for ax in axs:
ax.label_outer()
metrics = ["GRADUATION", "DROPOUT", "ATTENDANCE", "FARMS"]
for metric in metrics:
X = average_school_rates['numCrimes']
X = np.array(X).reshape(-1, 1)
y = np.array(average_school_rates[metric])
# Creating a linear regression model
reg = LinearRegression().fit(X, y)
X = sm.add_constant(X)
est = sm.OLS(y, X).fit()
[const_pvalue, x1_pvalue] = est.pvalues
[const, x1] = est.params
print(metric)
print(f"const: {const} (p-value: {const_pvalue})")
print(f"x1: {x1} (p-value: {x1_pvalue}")
print(f"R^2 {est.rsquared}")
print("-----------------------------------------")
After looking at the effects of drug-related and fraud-related crimes on key school performance metrics. It seems like all of these crimes have similar effects on the key school performance metrics. Though, some more than others. For example, the correlation coefficients for fraud related crimes hover at around 0.1, and the correlation coefficients for drug-related crimes hover at around 0.35. These are lower than the family-related crimes correlations coefficients.
From my exploratory data analysis I pulled the following observations/results:
While MCPS has employed greater support at underprivileged high schools, these trends still exist. Incidence of family offenses seem to have a siginificant impact on key educational outcomes. The question remains if MCPS can do more to help these students succeed while the odds are stacked against them or if this is more of a systemic issue that will always continue to exist so long as these types of crimes continue to occur.